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[57] ABSTRACT 

A method and apparatus for extracting multi-word 
technical terms from a text fde in a computer system. 
Word strings are selected from the text that have at least 
two words, that have at most a specified maximum 
number of words, that include none of a special set of 
selected tokens, and that only include selected charac- 
ters. Word string which occur less than a specified 
minimuTn number of times in the text file are deleted. 
The remaining strings form a set of word strings very 
likely to be multi-word technical terms. Improvements 
on the quality of the set of word strings can be accom- 
plished by deleting word strings which do not satisfy 
certain grammatical constraints. 

28 Claims, 2 Drawing Sheets 
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fonns being preferred candidates for inclusion in head- 

METHOD FOR EXTRACIING MULTI-WORD ings. Okamoto uses actual frequency information rather 

TECHNICAL TERMS FROM TEXT ^han simple repetition information as with the invention 

^« « «™« ™^ appHcation. Simple repetition information is only 

CROSS-REraRENCE ^ 5 information as to whether a word string appears a mini- 

APPLICATION number of times in a text file. 

This application is a continuation-in-part of co-pend- U.S. Pat No. 4,868,750 to Kucera et ai describes a 

ing application Ser. No. 07/785,641 filed Oct! 31, 1991, means for determining grammatical tags for sequences 

the priority of which is retained. of words in text. Kucera is concerned with a method of 

DESCRIPTION grammatical tagging, while the applicants* invention 

I F' M merely uses grammatical tagging. Applicants' invention 

cchmcal field could use the grammatical tagging of Kucera; however, 

This invention relates to a method and apparatus for his tagging method is not preferred, 

extracting multi-word technical terms from a text file in U.S. Pat. No. 4,888,730 to McRae et al teaches rc- 

a computer system. More specifically, a set of word placing frequently used words with their synonyms, 
strings satisfying specific constraints, is created which 

consists primarily of multi-word technical terms. SUMMARY OF THE INVENTION 

BACKGROUND OF THE INVENTION I* is therefore an object of this mvention to provide a 

Automatic extraction of terminology from text is ^ 'method and apparatus for extractm^ 

importantforavarietyofactivitiesthatkvolveprocess ^ terms from text with n^^ 

of natural language. It is an especially pressing need for ^^^^^ °^J^^ ^ mvention to automattcally 

the writers and translators of technical manuals, for ^'^^^^ multi-word technical terms from a text file in a 

whom it can help maintain the consistency and correct- computer system with a high rate of coverage and high 

ness of translation and usage and decrease the cost of ^ quality. Coverage refers to the percentage of actual 

their activities. multi-word technical tenns that are extracted from a 

Among the uses to which the identification of techni- given file while quaUty refers to the percentage of ex- 

cal terminology can be put are the following: tracted strings that are actually multi-word technical 

The identification of terms requiring translations in a terms, 

bilingual terminology dictionary for humans or in Accordingly, this invention provides a method and 

the automated dictionaries of a computerized natu- apparatus for extracting multi-word technical terms in a 

ral language translation system. text file in a computer system. With this invention word 

The identification of new terms requiring definition strings are selected from a text file with each word 

in a glossary of a technical document or in a dictio- string having at least two words and at most a specified 

r^^^^j . . , , , maximum number of words. The word strings are also 

TTie tdentificadon of terms in a text document ^j^^j^ ^ ^ hlloynng 

which should be used for mdexmg that document of constraints: frequency, token, and gram- 

m a computerized mformation retrieval system. Jr ^ *««-^f **t. u r 

The identification of domain-specific concepts in a The frequency constramt refers to the number of 

domain for ^ in a knowledge-representation sys- ^0 times that a word string appears m a file. The token 

tem that models that domain, constramt refers to the property that a word strmg 

The identification of additional entries for lexicons *^ot have a word which belongs to a selected set of 

for natural language parsers in order to improve words. This selected set need not be a* finite list of 

the performance of those parsers for a variety of words, but can include tokens which have characters 

applications. ^5 other than members of a selected set of allowable char- 

The identification of terms to be used in algorithms acters. Finally, satisfaction of grammar constraints 

for determining the topic of a text document. would require selected potential parts of speech to ap- 

A Technical term is a word string that has a particu- pear in selected positions of a word string. For example, 

lar meaning in a domain. A multi-word technical term is one such grammatica] constraint would be to require a 

a term that consists of more than one word. A technical 50 ^^^^ string end with a potential noim. 

term can be a common noun phrase such as "central jhe quality of the resulting list of word strings can be 

processing unit"10 or "market share". It may also be a fuTther improved by requiring not only one type of 

proper noun phrase such as "United Stetes Patent Of- constraint, but two or all three of the constraints to be 

c"" D ^""^ ^T^J'^^.^^'^^K w • « satisfied. Specifically, the word strings are intended to 

U.S. Pat No. 4,566,295 to K. Toth descnb« an mi- 55 ^ ^ ^^^^^ of adjectives (other 

pmvementmstenographicsystemsusmgwordfrequen- ^ ^quantifiers or determiners) Ld nouns (other than 

' U.S. Pat. No. 4,625,295 to J. T. Skinner describes a P^o^ouns or quantifier used as nouns) and that end in a 

hardware means for locating predefined characters. noun, or that consist oftwo such nomi phrases jomed by 

words, or combinations of words. 60 * preposition. 

U.S. Pat No. 4,744,050 to Hirosawa et al describes a BRIEF DESCRIPTION OF THE DRAWINGS 
method of determining the most frequently used phrases 

in a text, while this invention is concerned with noun P^^- * schematically illustrates the various compo- 

phrases that occur more than once in a text nents of the Invention with the text file as the input and 

U.S. Pat No. 4,813,010 to T. Okamoto et al deals 65 the candidate multi-word technical terms as the output 

with the extraction of hicrchical structure in a docu- FIG. 2 schematically illustrates the steps necessary to 

ment as indicated by section headings. Word, phrase, extract a list of candidate multi-word technical terms 

and symbol frequencies are used, with higher frequency from a text file. 



01/29/2004, EAST Version: 1,4,1 



5,423,032 

3 4 

proportion of the multi-word technical terms in a text 

DESCRIPTION OF THE PREFERRED file in a computer system. 

EMBODIMENT ^ 

„^ , , * u *, /ins PSEUDO-CODE FOR PREFERRED 

FIG. 1 shows the mvention to be an apparatus (10) '^'^ EMBODIMENT 
which accepts, as input, a text file (13) and produces, as 5 

output, a list of candidate technical terms (18). The Input to the program consists of a text file text—file, 
operation of the apparatus is controlled by a set of pa- Parameters are max length, the m a xim u m number of 
rameters, including a stoplist (11), a list of allowable words permitted in a candidate term; min_count. the 
characters (14) and a dictionary containing words' parts mitiimnm frequency a candidate word string must have 
of speech (12). The apparatus is made up of sub-modules 10 to be accepted as a candidate term; allow—preposition, 
which include a text fragment extractor (19), a word which is true if prepositions are to be allowed m candi- 
string extractor (15), a frequency checker (16), and a date terms, and which is false otherwise; lemmatize, 
grammatical constraint checker (17). which is true if all candidate word strings are to be 
FIG. 2 shows the steps involved in extracting a list of normalized to singular form, false if singular and plurals 
candidate terms (18) from a text file (13), in the pre- 15 are both to be listed; and no_spec_chars which is true 
ferred embodiment The text fragment extractor (19) if non-allowable characters are to be prohibited in can- 
produces text fragments (20) which are passed through didate terms and false otherwise. The default values for 
a word string extractor (15), to produce a list of word there parameters are 5 for max -l ength, 2 for min — 
strings (21). The text fragment extractor produces frag- count, true for allow— preposition, false for lemmatize, 
ments that are the Tnaximftl strings of consecutive words 20 and true for nO— spec_chars. The notation s[n,l] indi- 
from the text file that contain only allowable characters cates a substring of the string s, consisting of 1 consecu- 
(14) and that contain no words from the stoplist (11). tive words being at the nth word of s. 
The text fragments are passed to the word string extrac- The main procedure is names terms. This program 
tor one at a time, as said text fragments are produced calls three procedures for which pseudo-code is pro- 
from the text file. Word strings are produced from these 25 vided: substrings, frequency —constraint and gram- 
fragments by selecting all substrings of at least 2 and at mar_constraints. 

most a specified Tn^Timnm number of words whose last Other functions called by these procedures, and as- 

word has a noun entry in the dictionary of parts of sumed to be part of the computing environment, are: 

speech (12). The word strings produced from these length, the number of characters in a string; member, 

fragments are accumulated in the word string list (21), 30 which returns true if the first argument occurs in a list, 

and control passes back to the text fragment extractor which list is its second argument, and false otherwise; 

(19) until the text file is exhausted. The word string list only— contains, which returns true if all characters in a 

(21) is then passed to the frequency filter (16), which string which is the first argument are contained in a list 

produces an intermediate list of word strings (22) which of characters which is the second argument, and false 

the grammatical constraints checker (17) converts into 35 otherwise; concatenate, which creates a string^ by join- 

thefinallistof candidate terms (18) by elhninating those ing the two argument strings with a blank character 

word strings that do not satisfy die grammatical con- between them; min, which returns the smallest of its 

straints. The various filters and checkers may be applied arguments; max, which returns the largest of tis argu- 

in sequences different than the one shown here. These ments; part— of— speech, which returns a list of all the 

filters and checkers may also be applied simultaneously, 40 parts of speech listed for its argument in the part-of- 

thus reducing the number of intermediate word strihg speech dictionary, and noun if the argument is not in the 

lists. dictionary; singular, which returns a singular form for a 

The basic idea behind this invention is to extract a set noun argument; sort, which arranges the elements of a 

of word strings that is very likely to contain a high list in alphabetical order; next— record, which returns 

45 the next record of the list; and. tail, which returns the 
argument string but with its first token removed. 



begin tenns(maz— length, mia_count, aJlow—preposition, lemxnatize, no— spec_chars): 
if man length = " then max—length: = 5 
if min count =* " then niitt_count:= 2 
if allow— preposition b " then allow—preposition:— true 
if lemmatize « " then lemmatize: » false 
if no—spec—chan » " then no_spcc_chars:= true 
word_5tring_list:= " 
s:= texi—file 
current— fragment: *= '* 
for i from I to lcngth(s) { 
w:= sfn 1] 

if membcr<w, stoplist) [<no-^pec— chars & only— contains(w, allowable— chart}) then { 
if length(cniTent— fragment» 1 then sobstrings(cuxrcnt— fragment, max—length, lemmatize) 
current— fragment: C5 " 

}endif 

else current— fragments: concaienate(current— fragment, w) 
} end for loop 

if length(cunTent_fragment)> 1 then sub5trings(current— fragment, man -lengthy lemmatize) 
word..stnng_list:o frequency— constraint(word— string— list, mizL_count) 
candidate— list: o granunar_constraints(wonL. string—list, allow— preposition) 
end terms 

begin ^ubstrings(f, max length, lenmtatize): 
for 1 from 2 to min(mfli lengthy length(0) { 
w « fll.ll 

if mcmbciCnoun\ part^cf— spcech(w)) then iterate 
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-continued 

if lemm&tize then } 

w:ts singular(w) 

f:= concatenate(fIl» I — I], w) 
} end of if lemmatize 

for i from Tnax(l, 1 — max length + 1) to I— 1 { 

write fli, I— i+1] to word-_string_Jist 
} end of inner for loop 
} end of outer for loop 
end substrings 
begin frequency-.comtraint(list, min_count): 
list* B sortOist) 
old_n = " 
count: = 0 
out_Iist:= 0 
while (list not empty) ( 
next , rrcordQlst) 
if r B old—then count:— count + 1 
else{ 

if count ^min— count then write c(»icatenate{count, old_r) to out_list 
count: » 1 
old__r:« r 
} end of else 
} end of while loop 
retum(out_list) 
end frequency— constraint 
begin gramnuj...jConstraints(list» allow_preposition, lemmatize): 
while (list not empty) { 
r:B next record(list) 

szTB tail(r) / remove string's frequency from front of string / 
1:» length(s) 

no— preposition: = allow— preposition 
constraints— satisfied: » tme 
for i from 1 to 1— 1 while (constraints— satisfied) { 
w:= s(i. 1] 

if mcmber(*noun*, part— of— speecfa(w)) then iterate 

if member(*adj\ part— of— speecb(w)) then iterate 

if i> 1 & no— preposition Sc member(*prep\ part— of_speech(w)) then { 

no— preposition: = false 

w:= s[i-l, 1] 

if member('noun\ part— of_specch(w)) then iterate 
} end of if 

constraints satisfied:— false 

} end of for loop 

if constraints. . satisfied then write r to out— list 
} end of while loop 
retum(out— list) 
end grammar— constraints 
stoplist:^ 

/ quantifier and determiner adjectives and nouns / 
a all an another any both each either enough entire every following 
her his its my one's only our own preceding previous same several 
some sometime subsequent such that the your 
01234S6789one two three four five six seven eight nine ten 
eleven twelve thuteen fourteen fifteen sixteen seventeen eighteen nineteen 
twenty thirty forty fifty sixty seventy eighty ninety hundred thousand million billion trillion 

/ frequent words that are only rarely full nouns, full adjectives, or preposition / 
altogether am as away back being but can do does doing done get gets 
getting goes going gone got gotten have having here if like liked 
likes Uking made make makes making may might mine must now once or 
otherwise preceding said saw say saying says se seeing sees so somebody 
someplace something somewhere still than then there therefore twice 
willing 

/ frequent tokens that are never full nouns, full adjectives, or preposition / - 
».::!?..."00O<> 

again and are aren't be been can't cannot could couldn't did 

didn't doesn't don't few go had hadn't has hasn't haven't 

be he'd be'U he's hers hers him himself himself how I I'd I'D I'm 

is isn't it it'd itH it's itself itself just less many me mightn't more 

most much mustn't myself myself no oneself oneself other ours 

oiua ouiself ourselves ourselves shall shan't she she'd 

shell she's should shouldn't someone that'll 

that's theirs theirs them themselves themselves therefore they they'd 
they'll they're was wasn't we we'd we'D we're went were weren't who 
whom won't would wouldn't you you'd youll you're yours yours 
yourselves 

/ non-word tokens that do not normally occiir in terminology / 

bcdefghijklmnopqrstuvwjiyz es •— +/ 
allowable— chars: 

abcefghijklmnopqrstuvwxyxABCEFGHIJKLMNOPQR 
STUVWXYZ'.- 
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COMMENTS ON THE PREFERRED 
EMBODIMENT 



The invention is intended to provide both high cover- stoplist includes a large number of words that can be, 
age of a text's technical terminology and high quality of S but seldom are, used as full nouns, full adjectives, or 
the candidate terms extracted. Trade-offs between qual- prepositions. (3) Frequency alone: Some tokens that are 
ity and coverage occur in several ways. The preferred not nouns, adjectives, or prepositions are on the stoplist 
embodiment generally selects coverage over quality simply by virtue of their high frequency. Examples 
unless quality can be increased with tnmimftl sacrifice in include punctuation and certain conjunctions, pro- 
coverage. 10 nouns, adverbs, and verbs. Although fragments contain- 

Tbe option of accepting prepositions in the candidate ing such tokens would be eliminated later by the gram- 
string is a parameter, allow_preposition in the pseudo- . mar_constraints routine, processing expense is reduced 
code above; the value of this parameter defaults to true. when they are on the stoplist since the number of frag- 
Few candidates including prepositions normally prove ments whose collection is thereby avoided constitute a 
to be genuine terminology. Accepting a preposition in IS substantial proportion of the total number of fragments 
candidate strings provides greater coverage of valid that would otherwise be collected and submitted to the 
terms; excluding them provides greater quality among frequency— constraint and grammar—constraints rou- 
candidates. tines. (4) Non-word tokens that are rare in technical 

A word-by-word part-of-speech filter, gram- terms. These often enter text in variable names, or in 

mar— constraints, is used, rather than a full parser, to 20 fragments of equation, computer programs, and other 

recover candidates that, at least out of context, are non-textual data. 

consistent with the grammatical constraints. This may Control over minimum frequency is achieved by 

result in the production of candidates that, in the con- assigning a value to the parameter min— count, which 

text of their actual use in the text, are not in fact consis- defaults to 2. Groups of candidate terms of lower fre- 

tent with the constraints. For example, contained proce- 25 quency have lower quality than groups of candidate 

dure could be an adjective-noun sequence; this filter tennsof higher frequency; and the most frequent candi- 

would admit it as a candidate, even if it came from a date strings recovered from technical text are almost 

sentence such as: *The program contained procedure always valid technical terms. Particularly for longer 

calls in which it was actually a verb-noun sequence. documents, minimum frequencies greater than 2 nor- 

However, it also guarantees the highest coverage possi- 30 mally will substantially increase quality. However, 

ble, using a given part-of-speech database, of noun there is a definite loss of coverage associated with rais- 

phrases that do meet the grammar constraints, and is ing the minimum frequency parameter min-count Al- 

more efficient than parsing. Nonetheless, the grammati- tematively, lower frequency strings may be subjected 

cal constraints could be implemented instead by using to further quality constraints than are the higher fre- 

an automatic parser. 35 quency candidates. Incidentally, sorting candidate 

The preferred sequential embodiment breaks up the strings by frequency permits presentation in **best-first" 
text into fragments, before checking for repetition of order, in the sense that the quality of the list is higher 
substrings of the fragments. Fragments are the word toward the beginning than toward the end. The pro- 
strings which satisfy specific properties specified in the gram paramater lemmatize defaults to false. Some terms 
glossary. This is accomplished by prohibiting the gener- 40 occur in a text in both singular and plural forms; under 
ation of any fragment containing a word which appears the granunatical constramts of the algorithm, these 
on a stoplist of (mostly) high frequency tokens or which terms are taken to differ only in the grammatical num- 
contains non-allowable characters. All substrings of at ber of the final noun of the term. If neither the singular 
least two and at most xnax— length words and that end in nor plural forms occur at least min count times and 
a potential full noun are collected from each fragment. 45 lemmatize is false, such terms are not recovered by the 
(max length is also a program parameter, which de- algorithm, even if the singular and plural forms together 
faults to 5; empirically, very few technical terms have do satisfy the frequency constraint. These candidates 
more than 5 words.) The stoplist presented in the pre- are retrieved if lemmatize is passed to the program as 
ferred implementation includes mainly quantifiers, pro- true. Processing takes place at the beginning of the 
noims, demonstratives, and auxiliary verbs; among them 50 substrings procedure. The final word of each substring 
are most of the ISO most frequent words in English. of the text fragment is checked to assure that it is a noun 
Because of the extremely hi^ overall frequency of (thus, part of the grammatical constraints are handled in 
these tokens, this yields fragments much shorter than substrings are not in grammar— coxistraints); if so, then 
the original sentences, and substantially fewer sub- that noun is replaced by its singular form in the sub- 
strings to text for repetition and for agreement with 55 string, and the lemmatized substring is then written to 
part-of-speech constraints. The list of allowable charac- the output list of substrings. The frequency constraint 
ters, allowable— chars, contains upper and lower case and the remaining grammatical constraints are then 
alphabetical characters, apostrophe, hyphen, and per- applied to an already lemmatized lists of strings. If the 
iod. The values of stoplist and allowable— chars may be allow— preposition parameter is passed as true, this im- 
changed by the implementor to meet particular require- 60 plementation of leoamatization is a simple and efficient 
ments of the text being processed. approximation, rather than an exact implementation. In 

Token are on the stoplist for one of four reasons. (1) terms containing prepositions, usually the last noun 

Enforcing adjective/noun type: A stoplist provides an preceding the preposition should be lemmatized, 

efficient way of implementing the restriction of adjec- whether or not the last noun in the string is lemmatized. 

tives to full adjectives, excluding quantifiers and deter- 65 Thus, when two candidate strings occur with frequency 

miners, and thereby also the restriction of lexical nouns one, and differ from one another only in that the noun 

to full nouns, excluding non-fuU adjectives used as before the preposition is in the singular in one candidate 

nouns. This avoids reference to the categories of quanti- and in the plural in the other candidate, the lemmatized 
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form of the string has a frequency of 2 1 and is a liable plural noun is lemmatized by producing its singular 

candidate (assmning min— count=2). In the curreiit im- form. 

plementation, however, there is no check for preposi- multi-work technical term — A technical term which 

tions until the frequency constraint has been applied, so consists of two or more words, 

the string is not retrieved. The approximation provides 5 potential full adjective — An instance of a word, which 

adequate coverage, since the proportion of valid candi- word is used in some contexts as a full adjective, 

dates containing prepositions is relatively low, and only whether or not said instance is itself used as an adjec- 

a small proportion of even these cases will occur ex- tive. 

actly once in the singular and once in the plural. If potential full noun — An instance of a word, which 

desired, these cases could be recovered by implement- 10 word is used in some contexts as a full noun, whether 

ing le m matization in a more complex or less efficient not instance is itself used as a noun. Words 

w^y* which do not appear in the part-of-speech dictionary 

In case non-textual data occurs in the file, perfor- are assumed to be potential full nouns, 

mancc of the algorithm will be improved by preprocess- potential preposition— An instance of a word, which 

ing the file to remove such data, and applying the above 15 ^vord is used in some contexts as a preposition, 

procedure to the processed file. whether or not said instance as itself used as a prcpo- 

The invention can be implemented on any modem sition. 

sequential or parallel computer machinery, including quality-l-The proportion of candidate terms produced 

but not limited to personal computers, workstations, ^^e procedure that are actuaUy technical terms, 

minicomputers, mainframes, and supercomputers. 20 s^piist-A list of tokens which are assumed not to 

GLOSSARY occur in technical terminology. Typical members of a 

„ , ^ ^ . 11 J . stoplist are (1) frequency words which are not (or are 

allowable characters— Characters that are allowed to ^^.ka ^Ai^^ixr^ or n\ 

j'j^ 1^ ^ , only rarely) nouns, adjective, or prepositions, (2) 

occur m the candidate multi-word terms to be ex- ^ ..f 1 a n j j- *,* « u 

, - , nouns which are not full nouns and adjectives which 

^a^Hot*; TTf/^nt«I^f«f t>,, t.rm;«^i«»v iH^nrifi uot full adjcctives, and (3) tokens which occur in 

candidate hst-The ou^t of the terminology identifi- ■» 

cation mechanism. The candidate hst consists of ^ . . wm^u *"c "ui wwiu». 

word strings. The candidate list is intended to be of '^i"^ sequence of tokens. 

high quality and to have high coverage. substrmg-A word string which is contamed withm a 

candidate string, candidate term-A word string in the 30 ^^^t^^"^^ Substnngs are submitted to a frequency 

candidate list constramt and grammatical constramts m order to 

coverage-The proportion of those technical terms in determine if they will be included in the set of candi- 

the text which are among the candidate terms. date terms. 

fragment-A fragment is a sequence of tokens in a text techmcal tertn-A sequence of one or more words that 

file. The first token in a fragment is either the first 35 ^ a particul^ meanmg in some technical domain. 

token in the sentence or it immediately foUows a termmology-The set of techmcal terms m a text from 

token on a stoplist, or it immediately follows a token ^ particular techmcal domam. 

containing a character not among the allowable char- ^^^^ file— A computer file contammg natural language 

acters. The last token of a fragment is either the last * . . 

token of the text file, or it immediately precedes a 40 token— A sequence of text characters dehmited by 

token on a stoplist, or it immediately precedes a token ^^^nks and/or punctuation. 

containing a character not among the allowable char- token constramt— A condition, on substrings, that 

acters. A fragment contains no token on a stoplist, every token in the substring contains only characters 

and each token in a fragment contains only allowable that are allowable characters, and that no token in the 

characters. Any technical term in the text must be 45 substiing appears on the stopUst 

contained entirely within a fragment. word— A token considered to be part of a natural lan- 

frequency— The number of times a substring/word guage. Punctuation, numbers, mathematical expres- 

string appears in a file. ^ons, etc. are not words. 

frequency constraint — A condition that a substring word string— A sequence of one or more words, 

must appear more than some TniTiinnim number of 50 Having thus described our invention, what we claim 

times in the input text in order to become a candidate as new and desire to secure by Letters Patents is: 

tenn. 1- Programmed computer apparatus for extracting a 

full adjective— An adjective which is not a quantifier or Mst of candidate multi-word technical terms from an 

determiner. Adjectives that are not full adjectives input text file, a multi-word technical term being a 

appear on a stoplist 55 string of at least two words having a particular meaning 

full noun — A noun which is not a pronoun nor a non- in some technical field, said apparatus comprising: 

full adjective used as a noun. Nouns that are not full means for storing a stoplist of tokens which are as- 

nouns appear on a stoplist sumed to not occur in multi-word technical terms, 

grammatical constraint — A condition, on substrings, atokenbeinga word, character or string of charac- 

that requires selected parts of speech in selected posi- 60 ters delimited by blanks and/or punctuation; 

tions. Word strings not meeting this condition do not means for storing a maximum length parameter sped- 

become candidate strings. fying a maximum number of tokens in any candi- 

high frequency word — A word which occurs fre- date multi-word technical term to be extracted; 

quently in English. Such words are unlikely to occur means responsive to the stored stoplist for extracting 

in technical terms. 65 text fragments from an input text file by identifying 

lemmatize — To extract the morphological base of an delimiting tokens in the input text file, including 

inflected word by removing mfiection affixes or by . means for identifying as a delimiting token each 

looking up the base of an irregular inflected form. A token in the input text file which is the same as a 
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token in the stored stoplist, the identified delimiting 
tokens defining text fragments therebetween; 

means for deriving from the extracted text firagments 
all possible subsequences of tokens having a length 
of at least two tokens and not more than a maxi- 5 
mum number of tokens specified by the stored 
maximum length parameter;- 

means for testing each of the derived subsequences 
against at least one filtering condition; and 

means for creating a sublist of the derived subse- 10 
quences which pass the at least one filtering condi- 
tion, the created sublist being the list of candidate 
multi-word technical terms. 

2. Programmed computer apparatus as defined in 
claim 1 and further comprising means for storing a list 15 
of allowable characters for candidate multi-word tech- 
nical terois to be extracted, said means for extracting 
text fragments including means for identifying as a de- 
limiting token each token in the input text file which 
contains at least one character not in the stored list of 20 
allowable characters. 

3. Programmed computer apparatus as defined in 
claim 2 wherein said means for testing comprises: 

means for storing a frequency parameter specifying a 
minimum frequency of occurrence for a candidate 25 
multi-word technical term to be extracted; and 

means for using as a filtering condition that a derived 
subsequences being tested occurs m the input text 
file with a frequency not less than specified by the 
stored frequency parameter. 30 

4. Programmed computer apparatus as defined in 
claim 3 wherein a filtering condition against which each 
of the derived subsequences is tested by said testing 
means is a grammatical constraint. 

5. Programmed computer apparatus as defined in 35 
claim 4 and further comprising a dictionary of words 
for identifying potential parts of speech for each word 

in the dictionary, any word not in said dictionary being 
considered a potential full noun, and wherein said gram- 
matical constraint is that words at particular positions in 40 
a derived subsequence being tested potentially are spec- 
ified parts of speech. 

6. Progranmied computer apparatus as defined in 
claim 5 wherein said granmiadcal constraint is that a 
derived subsequence being tested satisfies the following 45 
conditions: 

the last word is a potential full noun; 
the first word is a potential full noun or potential full 
adjective; 

any intermediate word in a potential full noim, a 50 
potential full adjective or a potential preposition; 
and 

any intermediate word that is a potential preposition 
and not also either a potential full noun or a poten- 
tial full adjective is preceded by a word that is a 55 
potential full noun. 

7. Programmed computer apparatus as defmed in 
claim 1 wherein said means for testing comprises: 

a dictionary of words for identifying potential parts 
of speech for each word in tiie dictionary, any 60 
word not in said dictionary being considered a 
potential noun; and 

means responsive to said dictionary for using as a 
filtering condition whether a derived subsequence 
being tested includes a potential noun as a last to- 65 
ken. 

8. Programmed computer apparatus as defined in 
claim 1 wherein a filtering condition against which each 
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of the derived subsequences is tested by said testing 
means is a grammatical constraint. 

9. Programmed computer apparatus as defmed in 
claim 8 and further comprising a dictionary of words 
for identifying potential parts of speech for each word 
in the dictionary, any word not in said dictionary being 
considered a potential full noun, and wherein said gram- 
matical constraint is that words at particular positions in 
a derived subsequence being tested potentially are spec- 
ified parts of speech. 

10. Programmed computer apparatus as defined in 
claim 9 wherein said grammatical constraint is that a 
derived subsequence being tested satisfies the following 
conditions: 

the last word is a potential full noun; 
the first word is a potential full noun or potential full 
adjective; 

an intermediate word is a potential full noun, a poten- 
tial full adjective or a potential preposition; and 

any intermediate word that is a potential preposition 
and not also either a potential full noun or a poten- 
tial full adjective is preceded by a word that is a 
potential full noun. 

H. Programmed computer apparatus for extracting a 
list of candidate multi-word technical terms from an 
input text file, a multi-word technical term being a 
string of at least two words having a particular meaning 
in some technical field, said apparatus comprising: 

means for storing a stoplist of tokens which are as- 
sumed to not occur in multi-word technical terms, 
a token being a word, character or string of charac- 
ters delimited by blanks and/or punctuation; 

means for storing a frequency parameter specifying a 
minimum frequency of occurrence for a candidate 
multi-word technical term to be extracted; 

means responsive to the stored stoplist for extracting 
text fragments from an input text file by identifying 
delimiting tokens in the input text file, including 
means for identifying as a delimiting token each 
token in the input text file which is the same as a 
token in the stored stoplist, the identified delimiting 
tokens defining text fragments therebetween; 

means for deriving from the extracted text fragments 
each possible subsequence of tokens having a 
length of at least two tokens and which occurs in 
the input text file with a frequency not less than 
specified by the stored frequency parameter, 

means for testing each of the derived subsequences 
against at least one filtering condition; and 

means for creating a sublist of the derived subse- 
quences which pass the at least one filtering condi- 
tion, the created sublist being the list of candidate 
multi-word technical terms. 

12. Programmed computer apparatus as defined in 
claim 11 and further comprising means for storing a list 
of allowable characters for candidate multi-word tech- 
nical terms to be extracted, said means for extracting 
text fragments including means for identifying as a de- 
limiting token each token in the input text file which 
contains at least one character not in the stored list of 
allowable characters. 

13. Programmed computer apparatus as defmed in 
claim 11 wherein said means for testmg comprises: 

means for storing a maximum length parameter speci- 
fying a maximum number of token in any candidate 
multi-word technical term to be extracted; and 

means for using as a filtering condition whether a 
derived subsequence being tested has a length 
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which does not exceed the length specified by the storing a frequency parameter specifying a minimum 

stored maximum length parameter. frequency of occurrence for a candidate multi- 

14. Programmed computer apparatus as defined in word technical term to be extracted; and 

claim 11 wherein a filtering condition against which using as a filtering condition in the testing step 

each of the derived subsequences is tested by said test- 5 whether a derived subsequence being tested occurs 

ing means is a grammatical constraint in the input text file with a frequency not less than 

15. Programmed computer apparatus as defined in specified by the stored frequency parameter, 
claim 14 and further comprising a dictionary of words 20. A computer implemented method as defined in 
for identifying potential parts of speech for each word ^laim 17 wherein the testing step further comprises the 
in the dictionary, any word not in said dictionary being 10 g^^p^ q^. 

considered a potential fuU noun, and wherein said gram- storing a frequency parameter specifying a minimum 

matical constraint is that words at particular positions in frequency of occurrence for a candidate multi- 

a derived subsequence being tested potentially are spec- ^^^^ technical term to be extracted; and 

ificd parts of speech. ^ ^ ^ filtering condition in the testing step 

16. Programmed computer apparatus as definwi m 15 ^^^^^^ ^ ^^^^^^ subsequence being tested occurs 
claim 15 wherem said grammatical ^jstmnts is that a ,^ ^^^^ ^ frequency not less than 
derived subsequence bemg tested satisfies the following specified by the stored frequency parameter 

"^LSword is a potential full noun; , ^1. A computer unplemented method as defined in 

- , , . *; t r n «Tr« ♦«,f;«i '»a claim 17 wherem the testmg step includes the step of 

the first word is a potential full noun or potential full 20 ^ . i_ r^i. j • j 1. • ^ 

H ecti * testmg each of the denved subsequences agamst a gram- 

any iiter^iiiate word is a potential full noun, a po- ™^cal constraint fUtering condition, 

tential full adjective or a potential preposition; and ^2. A computer implemented method as defined m 

any intermediate word that is a potential preposition chum 21 and further compnsmg the step of stonng a 

and not also either a potential full noun or a poten- 25 dictionary of words identifying potential parts of speech 

tial full adjective is preceded by a word that is a each word in the dictionary, any word not m said 

potential full noun, dictionary being considered a potential full noun, and 

17. A computer implemented method of extracting a wherein the step of testing each derived subsequence 
list of candidate multi-word technical terms from an against a grammatical constraint includes the step of 
input text file, a multi-word technical term being a 30 determining whether words at particular positions in a 
string of at least two words having a particular meaning derived subsequence being tested potentially are speci- 
in some technical field, said method comprising the fied parts of speech. 

computer implemented steps of: 23. Apparatus as defined in claim 22 wherein said step 

storing a stoplist of tokens which are assumed to not of testing each derived subsequence against a grammati- 
occur in multi-word technical terms, a token being 35 cal constraint includes the step of determining whether 
a word, character or string of characters delimited a derived subsequence being tested satisfieis the follow- 
by blanks and/or punctuation; ing conditions: 

storing a maximum length parameter specifying a the last word is a potential full noiin; 

maximum number of token in any candidate multi- the first word is a potential full noun or potential full 

word technical term .to be extracted; 40 adjective; 

extracting text fragments from an input text file by any intermediate word is a potential full noun, a po- 

identifying delimiting tokens in the input text file at tential full adjective or a potential preposition; and 

least in part by identifying as a delimiting token gjiy intermediate word that is a potential preposition 

each token in the iAput text file which is the same as j^^i also either a potential full noun or a poten- 

a token in the stored stoplist, the identified delimit- 45 ^ adjective is preceded by a word that is a 

ing tokens defining text fragments therebetween; potential full noun. 

deriving from the extracted text fragments aU possi- 24. A computer implemented method as defined in 

ble . subsequences of tokens having a length of at wherein the testing step further comprises the 
least two tokens and no more than a maxim u m 

number oftokens specified by the stored maximum 50 ^^^^^ ^ dictionary identifying potential parts of 

length paranaeter; . * speech for each word in the dictionary, any word 

testmg each of the denved subsequences agamst at dictionary being considered a potential 

least one filtermg condition; and . 

creating a sublist of the derived subsequences which nun; u «,i 

pass die at least one filtermg condition, the created 55 » filtenng «>ondiUon whether a denved snbse- 

sublist being the list of candidate multi-word tech- quence bewg tested u,dud« a potential noun as a 

nical terms token by findmg potential parts of speech for 

18. A computer implemented method as defined in ^ken in the stored dictionary 

claim 17 and fiirther comprising the steps of storing a 25. A computer unplemented method of extractmg a 

list of allowable characters for candidate multi-word 60 list candidate multi-word technical terms from an 

technical terms to be extracted, and in the step of ex- "^put text file, a multi-word technical term being a 

tracting text fragments, identifying as a delimiting token string of at least two words having a particular meamng 

each token m the input text file which contains at least ia some technical field, said method comprising the 

one character not in the stored list of allowable charac- computer implemented steps of: 

ters. 65 storing a stoplist of tokens which are assumed to not 

19. A computer implemented method as defined in occur in multi-word technical terms, a token being 
claim 18 wherein the testing step further comprises the a word, character of string of characters delimited 
steps of: by blanks and/or punctuation; 
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frequency of occurrence lor « 

word technical term to be extracted; 
extracting test fragments from an input text file by 
"SJingdelimiting tokens in thempmtextme^ 

least in part by identifying as a delimitmg token 
StokLin the input textmewhichu^e^^ 
atoken in the stored stoplist. the identifi«i dehnut- 
tog Sens defining text fragm«.ts therebetwee^ 
deri^g from the exuacted text 
ble subsequences of tokens havmg « °^ 
least two tokens and which occur m t^e mput t«t 
fite with a frequency not less than spectfied by the 

JSli^rSSS^bse.uences against at 

ciTs»rs~ 

^!he at least one fdtering condition, the created 



sublist being the list of candidate multi-word tech- 

26""? c^ter implemented method as defined in 
cl^ K^d'Lhcr ^mpri^ng the^^ 
5 urof allowable characters for candidate 

'""It a computer implemented method as defined in 
cl^ ?6 wS the'testing step includes the st^ of 
Sg Lh of the derived subs^uences agamst a gram- 
matical constraint filtering condition, 
matica^co ^^^^^^^^ ^^thod as defined m 

clS^ >^Ethe\esting step includes the s^^^ 
tSg^ch of the derived subsequences agamst a gram- 
matical constraint fUtering condition. 
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